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Abstract 

We propose a family of statistical models for social network evolution 
over time, which represents an extension of Exponential Random Graph 
Models (ERGMs). Many of the methods for ERGMs are readily adapted 
for these models, including maximum likelihood estimation algorithms. We 
discuss models of this type and their properties, and give examples, as well 
as a demonstration of their use for hypothesis testing and classification. We 
believe our temporal ERG models represent a useful new framework for 
modeling time-evolving social networks, and rewiring networks from other 
domains such as gene regulation circuitry, and communication networks. 
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1 Introduction 



The field of social network analysis is concerned with populations of actors, in- 
terconnected by a set of relations (e.g., friendship, communication, etc.). These 
relationships can be concisely described by directed graphs, with one vertex for 
each actor and an edge for each relation between a pair of actors. This network 
representation of a population can provide insight into organizational structures, 
social behavior patterns, emergence of global structure from local dynamics, and 
a variety of other social phenomena. 

There has been increasing demand for flexible statistical models of social net- 
works, for the purposes of scientific exploration and as a basis for practical analy- 
sis and data mining tools. The subject of modeling a static social network has been 
investigated in some depth. For time-invariant networks, represented as a single 
directed or undirected graph, a number of flexible statistical models have been pro- 
posed, in cluding the classic Exponential Random Graph Models (ERGM) and ex 



tensions (Frank and 



Robins and Pattison, 



Strauss, 



1986 



Wasserman and Robins . 



2005 



Snijders . 



20021 : 



2005|) . which are descriptive in na ture, latent space mod- 



els th at aim towards clustering and community discovery (|Handcock and Rafterv 



2007 ). and mixed-membership block models for role discovery (|Airoldi et al. 



20081). Of particular relevance to this paper is the ERGM, which is particularly 



flexible in that it can be customized to capture a wide range of signature con- 
nectivity patterns in the network via user-specified functions representing their 
sufficient statistics. Specifically, if N is some representation of a social network, 
and M is the set of all possible networks in this representation, then the proba- 
bility distribution function for any ERGM can be written in the following general 
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form. 



nN) = ^^ey.^{e'n{N)} 



Here, 6 E M'^, and u : A/" — M'^. Z{0) is a normalization constant, which is typi- 
cally intractable to compute. The u function represents the sufficient statistics for 
the model, and, in a graphical modeling interpretation, can be regarded as a vector 
of clique potentials. The representation for N can vary widely, possibly including 
multiple relation types, valued or binary relations, symmetric or asymmetric re- 
lations, and actor and relation attributes. The most widely studied models of this 
form are for single-relation social networks, in which case N is generally taken to 
be the weight matrix A for the network (sometimes referred to as a sociomatrix), 
where Aij is the strength of directed relation between the i^^ actor and j*^ actor. 

Often one is interested in modeling the evolution of a network over multiple 
sequential observations. For example, one may wish to model the evolution of 
coauthorship networks in a specific community from year to year, trends in the 
evolution of the World Wide Web, or a process by which simple local relationship 
dynamics give rise to global structure. In such dynamic settings, where a time- 
series of observations of the network structure is available, several formalisms 
have been proposed to model t he dynamics of topological changes of such net- 



works over time. For example. 



Snijders 



(|2006|) has proposed a continuous -time 



model of network dynamics, where each observed event represents a single actor 
altering his or her outgoi ng links to optimize an obje ctive function based on local 



neighborhood statistics. 



Robins and Pattison 



(|200l|) have indepedently studied a 



family of models of network dynamics over discrete time steps, quite similar to 
those presented below; in some sense, the present work can be viewed as a further 
exploration of these models, their properties and uses. However, this exploration 
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goes beyond the (|Robins and Pattison , 



200 ih work, in that we explore the statis- 



tical properties of these models, such as their (non)degeneracy tendencies, and 
the quality of fit that these models achieve when applied to real network time 
series data; such properties have not previously been systematically investigated 
for these types of models, though related work has recently been done on static 



ERGMs (IHandcocM . 



20031). We also explore algorithmic issues in calculating the 



MLE estimators and performing hypothesis tests with these models. Furthermore, 
we feel that the added flexibility in the parametrization of these models below 
makes t hem somewhat easier to spe cify and work with, compared to the descrip- 



tion in ( Robins and Pattison , 



20011) . which although quite elegant, requires the 



sufficient statistics to be nondecreasing in the relation indicator variables for the 
network. 

In the following sections, we propose a model family we would like to refer to 
as temporal ERGM, or TERGM, that is capable of modeling network evolution, 
while maintaining the flexibility of a fully general ERGM. Furthermore, these 
models build upon a generic ERGM formalism, so that existing methods devel- 
oped for ERGMs over the past two decades are readily adapted to apply to the 
temporal models as well. We prove that a very general subclass of the TERGM is 
nondegenerate and explain how to calculate their maximum likelihood estimates 
from network data. Furthermore we show that these models can indeed be fitted 
to capture signature dynamic properties of real world evolving networks, and can 
be applied in hypothesis testing, nodal classification, and other applications. 
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2 Discrete Temporal Models 



We begin by describing the basic form of the type of model we study. Specifically, 
one way to simplify a statistical model for evolving social networks is to make a 
Markov assumption on the network from one time step to the next. Specifically, if 
A* is the weight matrix representation of a single-relation social network at time t, 
then we might make the assumption that ^* is independent of A^, . . . , given 
A*~^. Put another way, a sequence of network observations A^,. . . has the 
property that 

A^ . . . , A'\A^) = 7^(^*1 I . . . V{A''\A^). 

With this assumption in mind, we can now set about deciding what the form of the 
conditional PDF 7^(^4*174*"^) should be. Given our Markov assumption, one natu- 
ral way to generalize ERGMs for evolving networks is to assume A^\A^~^ admits 
an ERGM representation. That is, we can specify a function ^ : ]R„xn x IRnxn 
M^, which can be understood as a temporal potential over cliques across two time- 
adjancent networks, and parameter vector e M'^, such that the conditional PDF 
has the following form. 

V{A'\A'-\0) = ^^^^^exp{0'*(A*,>l*-i)} (1) 

We refer to such a model as a TERGM, for Temporal Exponential Random 
Graph Model. Note that specifying the joint distribution requires one to specify a 
distribution over the first network: A^. This can generally be accomplished fairly 
naturally using an ERGM. For simplicity of presentation, we avoid these details 
in subsequent sections by assuming the distribution over this initial network is 
functionally independent of the parameter 0. 
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In particular, we will be especially interested in the special case of these mod- 
els in which 

*(A*,A*-i) = 5^*,,(4,A*-i). (2) 

This form of the temporal potential function represents situations where the con- 
ditional distribution of A* given A^^^ factors over the entries A\j of A*. As we 
will see, such models possess a number of desirable properties. 

2.1 An Example 

To illustrate the expressiveness of this framework, we present the following sim- 
ple example model. For simplicity, assume the weight matrix of the network is 
binary (i.e., an adjacency matrix). Define the following statistics, which represent 
density, stability, reciprocity, and transitivity, respectively. 

ij 

^R{A\A'-')=n 
^T{A\A^-^)=n 

The statistics are each scaled to a constant range (in this case [0,n]) to enhance 
interpretability of the model parameters. The conditional probability mass func- 
tion (HI) is governed by four parameters: 9d controls the density, or the number of 
ties in the network as a whole; 9s controls the stability, or the tendency of a link 
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E 

ijk 



At At-L A 



t-1 At-1 



jk 



E4^'-^, 

ijk 



t-l 
jk 



that does (or does not) exist at time t — 1 to continue existing (or not existing) at 
time t; Or controls the reciprocity, or the tendency of a link from i to j to result 
in a link from j to i at the next time step; and 9t controls the transitivity, or the 
tendency of a tie from i to j and from j to k to result in a tie from i to A; at the next 
time step. The transition probability for this temporal network model can then be 
written as follows. 



2.2 More General Models 

For simplicity, we will only discuss the simple models described above. However, 
one can clearly extend this framework to allow multiple relations in the network, 
actor attributes, relation attributes, longer-range Markov dependencies, or a host 
of other possibilities. Li fact, many of the results below can easily be generalized 
to deal with these types of extensions. 

3 Estimation 

The estimation task for models of the form ([T]) is to use the sequence of observed 
networks, A^^, A^^, . . . , A^^, to find an estimator that is close to the actual pa- 
rameter values in some sensible metric. As with ERGMs, in general the normal- 
izing constant Z could be computationally intractable, often making explicit so- 
lutions of maximum likelihood estimation difficult. However, general techniques 
for MCMC sampling to enable approximate maximum likelihood estimation for 
ERGMs have been studied in some depth and have proven successful for a variety 
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of models dSnijders 



20021) . By a slight modification of these algorithms, we can 



apply the same general techniques as follows. 
Let 

L{e; N\ N\ . . . , N^) =\ogV{N\N\...,N^\N\e), (3) 

where expectations are taken over the random variable N*, the network at time t. 
Note that 

T 

VL(6>; N\...,N^) = {^{N\ N^-^) - M{t, 6)) 



t=2 



and 



J2{M{t,o)M{t,ey-cit,e)) 



t=2 



The expectat ions can be app roximated by Gibbs sampling from the conditional 



distributions (|Snijders . 



2OO2I) . so that we can perform an unconstrained optimiza- 



tion procedure akin to Newton's method: approximate the expectations, update 
parameter values in the direction that increases ([3]), repeat unt il convergence. A 



related algorithm is described by (IGeyer and Thom pson , 



nential families, and variations are given by (|Snijders 



19921) for general expo- 



2002 ) that are tailored for 



ERG models. The following is a simple version of such an estimation algorithm. 
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1 . Randomly initialize 6^^^ 

2. For 2 = 1 up until convergence 

3. Fort = 2, 3,..., T 
Sample ivg, . . . , N'^^ ~ P(N*|A^*-\ 6^"^] 

4) = iSf=i*(A^^?,A^'-^^ 



0(^+1) ^ 0» _ ^-1 ^^^^ ^(AT*, iV*-i) - 



(0 



The choice of B can affect the convergence of this algorithm. Generally, larger 
B values will give more accurate updates, and thus fewer iterations needed until 
convergence. However, in the early stages of the algorithm, precise updates might 
not be necessary if the likelihood function is sufficiently smooth, so that a B that 
grows larger only when more precision is needed may be appropriate. If compu- 

1 less certain) that the algorithrn 



tational resources are limited, it is possible (thoug 

migh t still converge even for small B values (see (|Carreira-Perpignan and HintonL 
2005b for an alternative approach to sampling-based MLE, which seems to remain 
effective for small B values). 



3.1 Product Transition Probabilities 

Although the general case ([B may often require a sampling-based estimation pro- 
cedure such as that given above, it turns out that the special case of Q does not. 
In this case, as long as the functions are computationally tractable, we can 
tractably perform exact updates in Newton's method, rather than approximating 
them with sampling. 
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3.2 Evalutation of Parameter Recovery 

To examine the convergence rate empirically, we display in Figure [U the conver- 
gence of this algorithm on data generated from the example model given in Sec- 
tion 12.11 The simulated data is generated by sampling from the example model 
with randomly generated 6, and the loss is plotted in terms of Euclidean distance 
of the estimator from the true parameters. To generate the initial A^^ network, 
we sample from the pmf exp{0'^(N^ , N^)}. The number of actors n is 
100. The parameters are initialized uniformly in the range [0, 10), except for Oo, 
which is initialized to —56s — 59 r — 59t- This tends to generate networks with 
reasonable densities. The results in Figure [T] represent averages over 10 random 
initial configurations of the parameters and data. In the estimation algorithm used, 
B = 100, but increases to 1000 when the Euclidean distance between parameter 
estimates from the previous two iterations is less than 1. Convergence is defined as 
the Euclidean distance between 0*^*+^^ and 0*^*-' being within 0.1. Since this partic- 
ular model is simple enough for exact calculation of the likelihood and derivatives 
thereof (see above), we also compare against Newton's method with exact updates 
(rather than sampling-based). We can use this to determine how much of the loss 
is due to the approximations being performed, and how much of it is intrinsic to 
the estimation problem. The parameters returned by the sampling-based approx- 
imation are usually almost identical to the MLE obtained by Newton's method, 
and this behavior manifests itself in Figure [Hby the losses being visually indistin- 
guishable. 
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Figure 1 : Convergence of estimation algorithm on simulated data, measured in 
Euclidean distance of the estimated values from the true parameter values. The 
approximate MLE from the sampling-based algorithm is almost identical to the 
MLE obtained by direct optimization. 
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4 (Non)Degeneracy of Temporal ERGMs 



There has been much concern expressed in the literature over certain degeneracy 
issues that arise w hen working with Exponential Random Graph Models. Specif- 



ically, Handcock (IHandcock , 



2003 b has recently been exploring the fact that, for 



many ERGMs, most of the parameter space is populated by distributions that place 
almost all of the probability mass on a small number of networks (typically the 
complete or empty graphs). This leads to several negative effects. For instance, 
since we do not expect the true generating distribution to be degenerate, the preva- 
lence of these degenerate distributions intuitively reflects a mismatch between the 
model and the type of process we wish to capture with it. Additionally, it is often 
the case that degenerate distributions can be found very close to any nondegener- 
ate distribution, so that slight variations in the parameters cause the distribution to 
become degenerate. This also leads to another problem, namely inference degen- 
eracy. Even if the generating distribution is modeled by some parameter values 
with a nondegenerate distribution, the degeneracy of nearby distributions may pre- 
vent commonly used maximum likelihood estimation techniques from converging 
to it within a reasonable sample size; specifically, the aforementioned MCMC 
techniques may r equire an impract ically large number of samples, or may even 



fail to work at all (Handcockl . 



I2003D. 

One natural question to ask is whether such issues also affect these temporal 
extensions. In the simple case where the transition distribution factors over the 
edges, as in (|2l), it turns out these models avoid such problems entirely. 
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4.1 Nondegeneracy of the Example Model 

To keep the initial explanation of this phenomenon as simple as possible, we be- 
gin this discussion by looking at the special case of the example model from Sec- 
tion [lU 

For any given entry Ajj, the networks A^^^ that minimize and maximize the 
probability that Ajj = 1 are the empty graph and complete graph; which one 
maximizes and which one minimizes it depends on the parameter values. If A^~^ 
is the empty graph, then 

t^l\^t-u_ exp{9D/{n-l)} 



exp{6£)/{n — 1)} + exp{9s/ (n — 1)} 
Under A^^^ as the complete graph, it is 



exp{ieD + es + eT + eR)/in~i)} 



expiiOD + es + eT + eR)/in - 1)} + 1 

So the entropy is lower bounded as follows: 

H(A') > min H(A'\A*-^) = mm\" H(A'AA'-^) > min H(A'AA'-^). 

ij ij 

We can lower bound min H{A\j\A^^^) by the quantity 

pin — h (1 — p) In ■ 



p 1 — p 

where p = 



exp{{\6D 




l + l 


Or\ 


+ \er\)/in-l)} 


exp{{\eD\ + \0s\ + \0R\ + \0T\)/{n 


-1)} + 


-exp{- 
1 


-(|^^D| + |^^5| + |^/?| + |^^T|)/(n-l)} 


exp{-2{\eD\ 


+ 1 




+ 


\0R 


| + |^T|)/(n-l)} + l- 
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(p is an upper bound on P(A*^ = and 1 —p is a lower bound on it). So the 

entropy lower bound is at most n{n — l)(pln(l/p) + (1 — p) ln(l/(l — p))), and 
thus as long as lOo] + 16*51 + I^^rI + I^t] is not too large, the entropy is guaranteed 
to be reasonably large. 

Other than the entropy, we can get a somewhat more intuitive grasp of this 
type of nondegeneracy result by bounding the expected number of nonzero entries 
in A*. In particular, a consequence of the above reasoning is that the expected 
number of nonzero entries in A* is at most 

1 

" ^Kxp{-2i\eD\ + \9s\ + \eR\ + lOrD/in - 1)} + 1' 

and is at least 

1 

Tl\Tl — 1 ) 

^ 'exp{2{\eD\ + \es\ + \eR\ + \eT\)/{n - 1)} + 1' 

So again, as long as \9d \ + \0s \ + \dR\ + \9t\ is not too large, we are guaranteed a 
reasonable expected number of nonzero entries in A^. 

To give an example of the types of entropy values one gets from a model of 
this type, in Figure [21 we plot the exact entropy values for the example model as 
a function oi 6d and 6s (with Or = 9t = ^ make a two-dimensional plot 
possible), and as a function of Od and Ot (with Or = 9s = 0); other options, such 
as fixing the unused parameters to nonzero values, yield similar plots. Specifically, 
the plotted values are the entropy of A^, where each is an x n matrix (where 
n = 7 in the left plot, and ri = 6 in the right plot), and is sampled from the 
basic Bernoulli graph model, in which each entry Ajj is an independent Bernoulli 
random variable with probability of being 1 equal to 0.25. 

It is worth briefly mentioning how these plots are generated. Since it is not 
computationally tractable to enumerate all graphs for each parameter setting, we 
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Figure 2: Entropy plots for the example model. In both plots, small magnitudes 
of the parameters give distributions with high entropy, as predicted. 

instead calculate it for equivalence classes of graphs which can be analytically 
shown to have identical probability values, and weight each class according to its 
size in the entropy calculation. For the first plot, since the conditional probability 
of given is only a function of how many edges are present in and how 
many ij values have Afj = Ajj, and since the edges of A^ are exchangeable, 
we can write the marginal distribution of purely in terms of the number of 
edges. Thus we need only calculate n(n — 1) probability values, and the entropy 
is a weighted sum, where the weights are combinatorial quantities reflecting the 
number of graphs with that many edges. For the second plot, the situation is 
more complex but the idea is similar. In this case, we define the equivalence 
classes based purely on graph isomorphisms. The number of distinct isomorphic 
networks of six nodes is 156, a significant reduction from the total number of 
networks, rendering the calculation of entropy computationally tractable. 

In both plots, small magnitudes of the parameters give distributions with high 
entropy, as predicted. 
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4.2 Nondegeneracy Under General Product Transitions 

We can generalize the preceding discussion beyond the simple example model, by 
considering general transition distributions that factor over entries of Ajj as fol- 
lows. Suppose {^k{A^, A*~^)}k is a sequence of functions such that A^~^) 
= Yjij "^ijkiAlj, A*"^) (i.e., satisfying Q), where each ^ijk has range contained 
in [—/3, f3] for some /3 > 0, so that the range of \E'fe is in [—n{n — 1)/?, n{n — l)/3]. 
Then we consider transition models of the form O, with these \E' values. That is. 



Note that these models factor over the entries A\j given A^^^. 
Theorem 4.1. Let 

1 

P 



(4) 



exp{2pj:M} + l 
For models of the form dH), the expected number of nonzero entries in A^ is in the 

range 

[n{n — n{n — 1)(1 — p)] , 
and the entropy can be lower bounded as 

H{A^) > n{n - 1) ( plog- + (1 -p)log^^ ) . 

V P ^-pj 

In particular, as long as 16*^1 is not too large, this bound implies the entropy 
will be reasonably large. 

Sketch. As above, we can upper bound the probability of a particular entry A\j 
taking value 1, given A*^^, by 

1 



exp{-2(3Y.^%\] + V 
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and lower bound it by 



exp{2l3Y.^\ek\} + l 
Since the conditional distribution given A^~^ factors over the edges of A^, the 

expected number of edges given is in this range, multiplied by n{n — 1). 

Since these bounds are independent of A*'~^, they also hold for the expectation 

under the marginal distribution of Similarly, as before we can lower bound 

the entropy under the marginal distribution of A* as 

H{A')>Y,M{A%\A'-\ 

ij 

and due to the aforementioned bounds on the conditional of Ajj, the quantity 

H(Alj\A^-^) is at least 

pin - + (1 —p) In . 

□ 



5 Hypothesis Testing: A Case Study 

As an example of how models of this type might be used in practice, we present 
a simple hypothesis testing application. Here we see the generality of this frame- 
work pay off, as we can use models of this type to represent a broad range of sci- 
entific hypotheses. The general approach to hypothesis testing in this framework 
is first to write down potential functions representing transitions one expects to be 
of some significance in a given population, next to write down potential functions 
representing the usual "background" processes (to serve as a null hypothesis), and 
third to plug these potentials into the model, calculate a test statistic, and compute 
a p-value. 
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The data involved in this example come from the United States 108*'* Senate, 
having n = 100 actors. Every time a proposal is made in the Senate, be it a bill, 
amendment, resolution, etc., a single Senator serves as the proposal's sponsor and 
there may possibly be several cosponsors. Given records of all proposals voted 
on in the full Senate, we create a sliding window of 100 consecutive proposals. 
For a particular placement of the window, we define a binary directed relation 
existing between two Senators if and only if one of them is a sponsor and the 
other a cosponsor for the same proposal within that window (where the direction 
is toward the sponsor). The data is then taken as evenly spaced snapshots of this 
sliding window, being the adjacency matrix for the first 100 proposals, for 
proposal 31 through 130, and so on shifting the window by 30 proposals each 
time. In total, there are 14 observed networks in this series, corresponding to the 
first 490 proposals addressed in the 108*'' Senate. 

In this study, we propose to test the hypothesis that intraparty reciprocity is 
inherently stronger than interparty reciprocity. To formalize this, we use a model 
similar to the example given previously. The main difference is the addition of 
party membership indicator variables. Let Pij = 1 if the i*** and j*** actors are 
in the same political party, and otherwise, and let Pij = 1 — Pij. Define the 
following potential functions, representing stability, intraparty density, interparty 
density u overall reciprocity, intraparty reciprocity, and interparty reciprocity. 



'We split density to intra- and inter-party terms so as to factor out the effects on reciprocity of 
having higher intraparty density. 
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1) 
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^3 
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-n 



V /I* 4*^1 

ij 
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The null hypothesis supposes that the reciprocity observed in this data is the 
result of an overall tendency toward reciprocity amongst the Senators, regardless 
of party. The alternative hypothesis supposes that there is a stronger tendency 
toward reciprocity among Senators within the same party than among Senators 
from different parties. Formally, the transition probability for the null hypothesis 
can be written as 



1 



exp<; Yl of^M'^A'-') 



je{S,WD,BD,R} 
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while the transition probabiUty for the alternative hypothesis can be written as 

,^ ) y je{S,WD,BD,WR,BR} 

For our test statistic, we use the likelihood ratio. To compute this, we compute 
the maximum likelihood estimators for each of these models, and take the ratio of 
the likelihoods. For the null hypothesis, the MLE is 

{eP = 336.2, = -58.0, 9% = -95.0, 9^ = 4.7) 

with likelihood value of e~^°^^-^^. For the alternative hypothesis, the MLE is 

{9^ = 336.0, 9^^^j, = -58.8, = -94.3, 9^^^^, = 4.2, 4^], = 0.03) 

with likelihood value of e~^°^^-^^. The likelihood ratio statistic (null likelihood 
over alternative likelihood) is therefore about 0.0041. Because the null hypothesis 
is composite, determining the p-value of this result is a bit more tricky, since 
we must determine the probability of observing a likelihood ratio at least this 
extreme under the null hypothesis for the parameter values 0^°^ that maximize this 
probability. That is. 



p-value = supVo < ^ < 0.0041 



In general this seems not to be tractable to analytic solution, so we employ a 
genetic algorithm to perform the unconstrained optimization, and approximate 
the probability for each parameter vector by sampling. That is, for each parameter 

vector 0*^"' (for the null hypothesis) in the GA's population on each iteration, we 
sample a large set of sequences from the joint distribution. For each sequence, we 
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compute the MLE under the null hypothesis and the MLE under the alternative 
hypothesis, and then calculate the likelihood ratio and compare it to the observed 
ratio. We calculate the empirical frequency with which the likelihood ratio is at 
most 0.0041 in the set of sampled sequences for each vector 6^'^\ and use this as 
the objective function value in the genetic algorithm. Mutations consist of adding 
Gaussian noise (with variance decreasing on each iteration), and recombination 
is perf ormed as usual. Full details of the algorithm are omitted for brevity (see 



(IMitchell . 



1996h for an introduction to GAs). The resulting approximate p-value 



we obtain by this optimization procedure is 0.024. 

This model is nice in that, because it has the form dJ]), we can compute the 
likelihoods and derivatives thereof analytically. In particular, in models of this 
form, we can compute likelihoods and perform Newton-Raphson optimization di- 
rectly, without the need of sampling-based approximations. However, in general 
this might not be the case. For situations in which one cannot tractably com- 
pute the likelihoods, an alternative possibility is to use bounds on the likelihoods. 
Specifically, one can obtain an upper bound on the likelihood ratio statistic by 
dividing an upper bound on the null likelihood by a lower bound on the alterna- 
tive likelihood. When computing the p-value, one can use a lower bound on the 



ratio by dividing a lower bo und on the null 



alternative likelihood. See (|Opper and Saadl . 



i kelihood by an upper bound on the 



2001 



Wainwright et al. 



2005) for 



examples of how such bounds on the likelihood can be tractably attained, even for 
intractable models. 

In practice, the problem of formulating an appropriate model to encode one's 
hypothesis is not well-posed. One general approach which seems intuitively ap- 
pealing is to write down the types of motifs or patterns one expects to find in 
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the data, and then specify various other patterns which one believes those motifs 
could likely transition to (or would likely not transition to) under the alternative 
hypothesis. For example, perhaps one believes that densely connected regions of 
the network will tend to become more dense and clique-like over time, so that one 
might want to write down a potential representing the transition of, say, k-cliques 
to more densely connected structures. 

6 Classification: A Case Study 

One can additionally consider using these temporal models for classification. 
Specifically, consider a transductive learning problem in which each actor has a 
static class label, but the learning algorithm is only allowed to observe the labels 
of some random subset of the population. The question is then how to use the 
known label information, in conjunction with observations of the network evolv- 
ing over time, to accurately infer the labels of the remaining actors whose labels 
are unknown. 

As an example of this type of application, consider the alternative hypothesis 
model from the previous section (model 1), in which each Senator has a class 
label (party affiliation). We can slightly modify the model so that the party labels 
are no longer constant, but random variables drawn independently from a known 
multinomial distribution. Assume we know the party affiliations of a randomly 
chosen 50 Senators. This leaves 50 Senators with unknown affiliations. If we 
knew the parameters 0, we could predict these 50 labels by sampling from the 
posterior distribution and taking the mode for each label. However, since both 
the parameters and the 50 labels are unknown, this is not possible. Instead, we 
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can perform Expectation Maximization to jointly infer the maximum likelihood 
estimator 6 for 6 and the posterior mode given 0. 

Specifically, let us assume the two class labels are Democrat and Republican, 
and we model these labels as independent Bernoulli(0.5) random variables. The 
distribution on the network sequence given that all 100 labels are fully observed 
is the same as given in the previous section. Since one can compute likelihoods 
in this model, sampling from the posterior distribution of labels given the net- 
work sequence is straightforward using Gibbs sampling. We can therefore em- 
ploy a combination of MCEM an d Generalized EM algorithms (call it MCGEM) 



(McLachlan and Krishnan 



1997|) with this model to infer the party labels as fol- 



lows. In each iteration of the algorithm, we sample from the posterior distribution 
of the unknown class labels under the current parameter estimates given the ob- 
served networks and known labels, approximate the expectation of the gradient 
and Hessian of the log likelihood using the samples, and then perform a single 
Newton-Raphson update using these approximations. 

We run this algorithm on the 108*'' Senate data from the previous section. We 
randomly select 50 Senators whose labels are observable, and take the remaining 
Senators as having unknown labels. As mentioned above, we assume all Sena- 
tors are either Democrat or Republican; Senator Jeffords, the only independent 
Senator, is considered a Democrat in this model. We run the MCGEM algorithm 
described above to infer the maximum likelihood estimator for 0, and then sam- 
ple from the posterior distribution over the 50 unknown labels under that maxi- 
mum likelihood distribution, and take the sample mode for each label to make a 
prediction. 

The predictions of this algorithm are correct on 70% of the 50 Senators with 
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unknown labels. Additionally, it is interesting to note that the parameter values the 
algorithm outputs {6s = 336.0, 6wd = — 59.7, 6bd = —9Q.0,6wb. = 3.8, 6bb. = 
0.28) are very close (Euclidean distance 2.0) to the maximum likelihood estima- 
tor obtained in the previous section (where all class labels were known). Compare 
the above accuracy score with a baseline predictor that always predicts Democrat, 
which would get 52% correct for this train/test split, indicating that this statisti- 
cal model of network evolution provides at least a somewhat reasonable learning 
bias. However, there is clearly room for improvement in the model specifica- 
tion, and it is not clear whether modeling the evolution of the graph is actually 
of any benefit for this particular example. For example, after collapsing this se- 
quence of networks into a single weighted graph with edge weights equal to the 
sum of edge weights over all graphs in the sequence, run ning Thorsten Joachims' 



2003h gives a 90% prediction 



Spectral Graph Transducer algorithm (|JoachimsL 
accuracy on the Senators with unknown labels. These results are summarized in 
Table \T\ Further investigation is needed into what types of problems can bene- 
fit from explicitly modeling the network evolution, and what types of models are 
most appropriate for basing a learning bias on. 



Method 


Accuracy 


Baseline 


52% 


Temporal Model 


70% 


SGT 


90% 



Table 1: Summary of classification results. 
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7 Assessing Statistic Importance and Quality of Fit: 
A Case Study 

In this section, we use TERGMs to model the network transitions of the lOS*'^ 
U.S. Senate network, described in Section[51 The dynamic network has 100 nodes 
and 12 time point j^. We perform two types of experiments here: the first is simply 
to assess which statistics are important for modeling the network transitions, by 
observing the magnitudes of the estimated parameters, and the second assesses 
the quality of fit of a model with a cross-validation experiment. 

We start with including three statistics: Density, Stability, and Reciprocity. 
The estimated parameters are plotted in Figure [3l We have estimated 1 1 sets of 
model parameters, so each box plot in a subplot contains 11 values. Judging by 
the magnitudes of the weights, we can see that Density and Stability play big roles, 
whereas Reciprocity plays a minor role in this case. 
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Figure 3: Estimated parameter values (weights) for a TERGM with 3 statistics 
(features). 



^We have removed the first two time points from the original 14-step series of Section|5] due 
to outher behavior in the initial two time points. This behavior is explained by an initial surge 
in activity when the Senate reconvenes after a vacation, but is not part of the usual "stationary" 
behavior, so we chose to exclude it when evaluating the quality of fit. 
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The 3-statistic TERGM model is so simple that we would not expect it to have 
enough modeling power. Therefore, we introduce some 3-node statistics and 
expand to include 7 statistics in the model. The 4 new statistics are Transitivity, 
Reverse-Transitivity, Co-Supported, and Co-Supporting, which are illustrated in 
Figure [6l Transitivity has been explained in earlier sections. Reverse-Transitivity 
means that if person B supports person C and person C supports person A at time 
(t—l), then it is likely that person A will support person B at time t. Co-Supported 
says that if both A and B are supported by a third person at time t, then it is likely 
that person A will support person B at time t. Co-Supporting is defined similarly. 
In all of the cases, we are looking at the influence from the previous time point on 
the link from A to B at time t. 

More formally, the new statistics are defined as 



v|/c5d(A*,A*-i)=n 



E.t-l At-l At I At-l 
^jk ^ki ^ij I 2-^^jk ^ki 



ijk 



ijk 



^ki ^kj ^ij I Z^^ki ^kj 



ijk 



ijk 



n 



^ik ^jk ^ij I 2-^^ik ^jk 

ijk ijk 



Figure |4] shows the parameter values for the TERGM model with these 7 statis- 
tics. Among the 4 new statistics. Transitivity and Co-Supporting are major con- 
tributors, while Reverse-Transitivity and Co-Supported are neglectable. The ef- 
fectiveness of Transitivity is intuitive, but the big contrast between the weights for 
Co-Supporting and for Co-Supported could be intricate. It can be explained by the 



^We call Density, Stability, and Reciprocity 2-node features because their decomposed forms 
only involve two nodes. 
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nature of the data. Each proposal has a single sponsor and possibly multiple co- 
sponsors. Therefore, each Senator is likely to be a sponsor (supported by others) 
for few proposals, while she or he could potentially be a co-sponsor (supporter) 
for many more proposals. When the Co-Supporting case happens, it is likely that 
the two Senators supported a third Senator on the same proposal, which suggests a 
shared position on the issue, which could further lead to a cooperation on another 
proposal at a later time. In contrast, when the Co-Supported situation happens, it 
is certain that the two Senators are supported by a third senator on different pro- 
posalso Although they are co-sponsored by a same Senator, these proposals can 
be in very different areas, which does not necessarily suggest a common interest 
for the two sponsors. 
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Figure 4: Estimated parameter values (weights) for a TERGM with 7 statistics 
(features). 



Next, we add two more 3-node statistics. Popularity and Generosity, to the 
model to have a set of 9 statistics. As illustrated in Figured Popularity says that 
if one has a supporter, she or he is likely to have another supporter. Generosity 



"^Links corresponding to a proposal are pointing to a single node, since there is only one sponsor 
for each proposal. 
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can be understood in the same manner. More formally, they are defined as 
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Figure 5: Estimated parameter values (weights) for a TERGM with 9 statistics 
(features). 




Transitivity Reverse- Co-Supported Co-Supporting Popularity Generosity 
Transitivity 



Figure 6: Graph illustrations of six 3-node statistics corresponding to Features 
4—9 in Figure |4] and [51 Blue circles are nodes; black solid arrows represent links 
(or a supporting relationship) at time (t — 1); red dotted arrows represent an edge 
at time t. 

Figure |5] shows the estimated parameter values for the TERGM with 9 statis- 
tics. Both Popularity and Generosity have significant weights, while Transitivity 
has a weight with low magnitude. In some sense, the Transitivity statistic can be 
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viewed as Popularity plus Generosity with some constraints. This plot indicates 
that the importance of Popularity and Generosity was the indirect cause of our 
emphasis on Transitivity in the simpler models, since in a model containing these 
statistics, Transitivity becomes irrelevant. 

Next, we heuristically evaluate the quality of fit of the model using a cross- 
validation style experiment, as follows. For each of the time points t (except t = 
1), we estimate a set of parameters for a TERGM to fit all the observed transitions 
except the transition from time point t — 1 to time point t. We then sample a 
number of networks from the conditional distribution over the network at time t 
given the observed network at time t — 1, under the estimated parameters. Finally, 
we compare the 'i!(A^,A^^^) statistic values from the sampled A* networks to 
the "^{A^, y4*~^) statistic values from the true A* network. We repeat this entire 
process for each t > 1 to generate our plots. The results of this comparison reflect 
relatively how well TERGMs model the transitions, given that we are committed 
to a Markov assumption for the transitions. 

Figure |7] presents the comparison of statistic values between ground-truth and 
sampled networks from the estimated TERGMs from the cross-validation process 
described above. For a few statistics, the blue lines lie within the green lines (i.e., 
in the range of sampled networks) for most time points, which means that the 
model does a fairly good job of predicting the change in statistic values: for ex- 
ample. Reciprocity, Reverse-Transitivity, Co-Supported, Co-Supporting, and Pop- 
ularity. It is worth noting that we can even capture some sharp changes with these 
models: for instance, Reverse-Transitivity, Co-Supporting, and Popularity at time 
point 7 (the last of these is particularly dramatic). This is somewhat surprising, 
as it intuitively seems like sharp changes might be quite difficult to predict with a 
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Markov assumption and such simple statistics. On the other hand, not all statistics 
are predicted well, such as Stability, where the predictions are quite poor; as such, 
there is clearly room for improvement in the design of statistics to more accurately 
model time-evolving networks. 

8 Future Work 

If we think of this type of model as describing a process giving rise to the net- 
works one observes in reality, then one can think of a single network observation 
as a snapshot of this Markov chain at that time point. Traditionally one would 
model a network at a single time point using an ERGM. However, in light of the 
degeneracy issues found in ERGMs, and the lack thereof for the temporal models 
with product conditional distributions, it seems worthwhile to investigate mod- 
eling a single network as the end-point of an unobservable sequence. Directly 
modeling this with latent variables would seem to make inference computation- 
ally difficult. However, it may be possible to indirectly model this by studying the 
stationary distribution of these Markov chains. To our knowledge, it remains an 
open problem to directly specify the family of stationary distributions that a given 
TERGM corresponds to. 

Moving forward, we hope to move beyond these ERG-inspired models toward 
models that incorporate latent variables, which may also evolve over time with the 
network. For example, it may often be the case that the phenomena represented 
in data can most easily be described by imagining the existence of unobserved 
groups or factions, which form, dissolve, merge and split as time progresses. The 
flexibility of the ERG models and the above temporal extensions allows a social 
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108-th Senate 



108-th Senate 



108-th Senate 




Figure 7: Statistic values of real networks and sampled networks based on a 
TERGM with 9 statistics. The comparisons are grouped by statistic. Blue solid 
lines indicate the observed (true) network statistics. Box plots are for the sampled 
networks (in the described cross-validation experiments) and green dotted lines 
indicate 5- and 95-percentiles. 

scientist to "plug in" his or her knowledge into the formulation of the model, while 
still providing general-purpose estimation algorithms to find the right trade-offs 
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between competing and complementary factors in the model. We would like to 
retain this flexibility in formulating a general family of models that include evolv- 
ing latent variables in the representation, so that the researcher can "plug in" his 
or her hypotheses about latent group dynamics, evolution of unobservable actor 
attributes, or a range of other possible phenomena into the model representation. 
At the same time, we would like to preserve the ability to provide a "black box" 
inference algorithm to determine the parameter and variable values of interest to 
the researcher, as can be done with ERGMs and now TERGMs. 
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